Random Forest Regression Trading Model

Intent:

This modeling process was meant to test techniques that approximate a target price of multiple stocks in the S&P 500 at some point in the future. The intent is to include technical indicators, which are used by traders to uncover trends into future stock performance.

Hypothesis:

Given traders frequently make decisions on purchasing assets based on technical indicators, or statistics about the movement of stock prices and trading activity, I hopthesize that these decisions can be modeled using advanced modeling techniques and hopefully identified before actual price movement happens. Automating this via machine learning should allow us to mechanically trade faster than a normal human using manual techniques.

Part 1: Feature Selection

In this section, we train a lightly configured Random Forest Regressor to get feature importance of all features. Using this set of features, we order based on importance and only select features over a certain threshold of importance

Part 2: Exploring Target Variables

We are going to explore model prediction capabilities by training 4 models tasked with predicting future price over some time horizon (7, 30, 60, 120 days), which will tell us if we should be buying or selling to take advantage of price changes

Observations:

  1. Looking at prediction performance. So far, our model does a horrible job at predicting the future

Adding additional models with longer targets

Quick Observations for Correlation:

  1. Fairly high correlation between all features, but that's expected as features are all calculated based on some price or volume change behavior
  2. We could use a smaller subset of manually determined, less correlated features. Or, we could manually engineer new features based on commonly used technical indicators (such as the "death cross", where the short term MA (typically 50 days) goes below the long term moving average (typically 200 days)
  3. Other methods are out of scope for this work, as you will see later, our model actually performs very well

Observations:

  1. Every model does a fairly good job of predicting the first data point forecasted in the future, but performance drops off significantly from there
  2. Also, our train predictions all have R^2 greater than 99%, which is common for Random Forest models, that can have a tendency to overfit
  3. Interesting enough, all models have variation in what features they think are important, and the 7 day model has a greater didstribution of technical features with similar importance

Part 2b: Perform GridSearch to Find Best Model Configuration

GridSearch will be performed to look for optimal model configuration. This can also help limit compute time, by allowing you to find a less compute intense model upfront (e.g. n_estimators = 500 takes substantially more time to train that n_estimators = 100)

Creating Dataframes of model metrics for AAPL

Part 3: Using past knowledge to re-train our model

Given we see every first point we predict is much more aligned to actual test data, we will explore an approach that will allow us to re-train the model every day once we have additional info

Observations:

  1. As expected, re-training our model daily provides much better accuracy for making predictions into the future, as we are able to incorporate new information for each model build
  2. I believe this is true given we are able to minimize the effects of the "random walk" theory commonly attributed to stock market price movements. We are also able to better minimize the effects of exogenous or econmomic factors that actually lead to more concrete changes in stock prices

Part 4: Picking a subset of tickers and creating models for each

In this section, we will make assumptions about tickers and their performance (based on research/tribal knowledge) and build models for each to allow us to start to create a diversified portfolio. We will then iteratively train models for each ticker over different prediction time horizons and portfolio rebalance cycles to compare performance

Observations:

  1. Overall the portfolio outperforms the S&P 500 from Oct'20 thru Jun'21
  2. The algorithm actually never sells any stock once it buys to begin with, but the mix does well enough that substantial drops do not occur
  3. Real Estate is a significant portion of the portfolio, which has performed well as of late, but this could create some diversification risk if we run into recessionary times similar to the Great Recession
  4. Common market risk metrics are also favorable for our portfolio over a risk-free investment, which is a common way to assess portfolio health

Preparing and loading data to postgres

  1. The output of the portfolio and trading is used to build visualizations in our dashboard
  2. This section is used to prepare data for loading, but loading is done in /assets/models/tyler_rf_daily_update/Loading_Model_Data_Postgres.ipynb notebook

Training and Building Portfolios Over Multiple Time Horizons

In this section, we will configure multiple buy/sell strategies based on portfolio rebalance timeframes and prediction timeframes. In the interest of simplifying, we will only run one configuration at a time

Part 5: Backtesting

In this next section, we will work to backtest a couple high performing portfolios, as uncovered from the dashboard. This approach involves finding a time of greater losses (or stress) for the entire market, and seeing how well the strategy does. Given the model mostly runs during a time of positive market movement, we haven't ensured it can handle downturns. For the stressed scenario, we will use the time around the start of the COVID-19 pandemic, where losses were "exaggerated"

Observations:

  1. Even during a stressed scenario, our longer term portfolio, predict 120 days in the future and re-balance every 60 days, is at parity with the overall market. This is most likely due to not aggressively selling during stressed scenarios
  2. One scenario missing that may be helpful to do another backtest on would be a longer period of stress, such as the Great Recession. Also, portfolio sector weights play a factor and can change performance in different types of stress scenarios. Both Energy (36%) and Real Estate (11%) have all experienced specific stressed scenarios in the past
  3. The 7 and 7 model also does a great job of tracking the market during a downturn, which leads me to believe that realized gains during great times would not be lost during bad